Although movies are mainly made for entertainment, not earning enough profit puts filmmakers in an awkward situation that makes it difficult to keep producing high-quality movies. One most popular quality metric is the score from Internet Movie Database (IMDb). Based on the metadata from IMDb, it would be interesting to analyze what makes a movie more successful than others, both commercially and critically. Therefore, the main goal of this project is to explore the IMDb dataset with focus on profit and IMDb score and present the findings in an intuitive and interactive way.
These are the R packages required for this project.
library(tidyverse)
library(knitr)
library(plotly)
library(ggrepel)
library(DT)
library(tm)
library(openxlsx)
The dataset used in this project came from the IMDb 5000 Movie Dataset from Kaggle. It recorded information on more than 5000 movies across 66 countries from 1916 to 2016. The dataset is available in a csv format file and is of size 1MB. Note that the original dataset is replaced on Kaggle website, and therefore we cannot access the original one. The following link is where we accessed data: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset (Yueming Zhang 2017).
The data preparation part consists of the following tasks:
First, we imported the data and checked the dimension and names of all attributes of the data. As shown below, there are 5043 movies recorded, and each record has 28 attributes, including information such as “Movie’s Title”, “Director’s Name”, “Budget of the movie”, and “IMDb score of the movie”.
## [1] 5043 28
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
Here is the preview of the raw data.
Then, to get the data ready for analysis, some tidying and cleansing works need to be done. First, unnecessary characters in the movie_title, genre and plot_keyword columns were removed.
Second, duplicates in movie_title column were removed as they may affect later analysis. In total, 126 duplicate movies were removed.
## [1] 126
Third, columns that contain currency, such as the budget and gross, may cause problems in later analysis because a few countries were not converted to US dollars, such as “South Korea” and “Japan”. Furthermore, given that all of them are converted to US dollars, we still need to consider the inflation, which makes the problem even more complicated. Thus, only movies from USA were kept for the profitability analysis.
## # A tibble: 4,917 x 4
## movie_title budget country gross
## <chr> <dbl> <chr> <int>
## 1 Lady Vengeance 4200000000 South Korea 211667
## 2 Fateless 2500000000 Hungary 195888
## 3 Princess Mononoke 2400000000 Japan 2298191
## 4 Steamboy 2127519898 Japan 410388
## 5 Akira 1100000000 Japan 439162
## 6 Godzilla 2000 1000000000 Japan 10037390
## 7 Kabhi Alvida Naa Kehna 700000000 India 3275443
## 8 Tango 700000000 Spain 1687311
## 9 Kites 600000000 India 1602466
## 10 Red Cliff 553632000 China 626809
## # ... with 4,907 more rows
Then, an new column profitable was added to indicate if a movie is profitable, ‘1’ means profitable (that is, profit \(>\) budget). As this involving both gross and budget columns, only movies from USA have non-empty value for this column. There are 3711 USA films, each of which has 31 attributes. Note that this version of data was saved to an csv file for people who want to focus only on USA films.
## [1] 3711 31
Last, regarding records that contain missing values, to keep the entire dataset as complete as possible, we decided to not remove any rows with missing data and to handle this issue for each individual analysis. For example, when doing genre-wise analysis, those without values for genre variables are excluded from the analysis.
After the cleaning, there are 4917 records remain and each have 30 attributes, and these are saved to a csv file for future use.
## [1] 4917 30
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
## [29] "genres_new" "plot_keywords_new"
Additionally, 3700 out of 4917 records do not have any missing value. They are also saved to a csv file.
## [1] 3700
The following table gives the name, type, and description of each variable in the dataset.
| Name | Type | Description |
|---|---|---|
| color | character | Colorization: Color or Black and White |
| director_name | character | Name of the director |
| num_critic_for_reviews | integer | Number of Critical Reviews |
| duration | integer | Duration of the movie in Minutes |
| director_facebook_likes | integer | Number of FB Page Likes of Director |
| actor_3_facebook_likes | integer | Number of FB Page Likes of Actor No.3 |
| actor_2_name | character | Name of Actor No.2 |
| actor_1_facebook_likes | integer | Number of FB Page Likes of Actor No.1 |
| gross | integer | Gross Earned in US Dollars |
| genres | character | Classification: Action, Comedy, Drama, …, etc. |
| actor_1_name | character | Name of Actor No.1 |
| movie_title | character | Title of the Movie |
| num_voted_users | integer | Number of Voted Users on IMDB |
| cast_total_facebook_likes | integer | Total FB Page Likes of of the Entire Cast |
| actor_3_name | character | Name of Actor No.3 |
| facenumber_in_poster | integer | Number of the Actors Featured in the Movie Poster |
| plot_keywords | character | Keywords Describing the Plot |
| movie_imdb_link | character | IMDB Link of the Movie |
| num_user_for_reviews | integer | Number of Users who Reviewed the Movie |
| language | character | Language of the movie: English, French, Chinese, …, etc. |
| country | character | Country where the Movie was Produced |
| content_rating | character | Content rating |
| budget | double | Budget in US Dollars |
| title_year | integer | Year of Release |
| actor_2_facebook_likes | integer | Number of FB Page Likes of Actor No.2 |
| imdb_score | double | IMDB Score on a Scale of 1 to 10 |
| aspect_ratio | double | Aspect Ratio |
| movie_facebook_likes | integer | Number of FB Page Likes of the Film |
| genres_new | character | Edited genres |
| plot_keywords_new | character | Edited plot_keywords |
The analysis is focused on four main aspects: genre, country, IMDb score, and how they relate to profitability.
In the first part of the analysis, the focus is on the film genres as well as their relationship with profitability. Since most of movies in this dataset were categorized as multiple genres, some preprocessing on the genres must be done before actually analyzing the data.
First, a document-term matrix for genres was constructed using the package “TM”.
## <<DocumentTermMatrix (documents: 4917, terms: 26)>>
## Non-/sparse entries: 14127/113715
## Sparsity : 89%
## Maximal term length: 11
## Weighting : term frequency (tf)
Then, the created document-term matrix was used to calculate frequency for each genre.
## # A tibble: 26 x 2
## genre count
## <chr> <dbl>
## 1 drama 2533
## 2 comedy 1847
## 3 thriller 1364
## 4 action 1113
## 5 romance 1084
## 6 adventure 888
## 7 crime 868
## 8 sci-fi 594
## 9 fantasy 583
## 10 horror 539
## # ... with 16 more rows
After preprocessing was done, we first try to find which genres are used the most. By plotting distribution of genres frequency, we can see that the top 5 movie genres are “Drama”, “Comedy”, “Thriller”, “Action”, and “Romance”.
Now, we want to identify which genre tend to have higher profitability as well as ratings. As mentioned before, this part involves movies only from USA due to the currency conversion issue.
First, we calculate the average budget, gross, and profit (= gross - budget) for each genre.
## # A tibble: 23 x 5
## genres_new mean_gross mean_budget mean_profit mean_imdb
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Action 85881165. 70503681. 15377484. 6.24
## 2 Adventure 107588235. 80902013. 26686222. 6.41
## 3 Animation 120847003. 85607853. 35239151. 6.64
## 4 Biography 46089398. 29135662. 16953735. 7.10
## 5 Comedy 54540529. 35059454. 19481075. 6.14
## 6 Crime 44945166. 33948142. 10997025. 6.50
## 7 Documentary 14683626. 4021501. 10662126. 6.84
## 8 Drama 42460107. 29585071. 12875036. 6.72
## 9 Family 94310623. 64866065. 29444559. 6.19
## 10 Fantasy 90539880. 66073493. 24466387. 6.23
## # ... with 13 more rows
Then, we compared them using a grouped bar chart sorted by a descending order of the mean profit. And here are what I’ve found:
Last, we tried to analyze a film’s successfulness based on its IMDb score.
First, in order to understand the distribution of scores of all films, we plot a histogram of IMDb score altogether with several summary statistics. The mean IMDb Score is 6.4, the median is 6.6, the lowest score is 1.6, and the highest score is 9.3. Furthermore, by computing the 10th and 90th percentile, we can see that 80% of movies have a score between 7.7 and 5.1. For a movie to be a top \(10\%\) in the worlds, it has to have a score of at least 7.7.
## imdb_score
## Min. :1.600
## 1st Qu.:5.900
## Median :6.600
## Mean :6.464
## 3rd Qu.:7.200
## Max. :9.300
## 10% 90%
## 5.1 7.7
Here is the IMDb score distribution.
Top 10 movies with highest IMDb score
Top 10 directors with highest average IMDb score
To understand the relationship between IMDB score, gross, and budget, we plotted a 3D scatter plot using the package “plotly” to try to gain a big picture about it. This part, again, involves films from the USA. Note that a green point means it is profitable while a blue one means it’s not and that a bigger point means greater profit earned. Since this is an interactive plot, we can rotate the graph to exmaine the relationship between any two of the three variables.
From the plot, we can see that movies with higher IMDb score tend to have higher profit and significant number of movies ended up losing money. This aligns with my intuition, IMDb score and gross might be correlated as people are more willing to watch famous and highly-rated movies. We also can observe that bigger budget does not guarantee the quality.
However, people may be more interested in those films with huge commerical success, top \(1\%\) (\(\sim 40\)) most profitable (highest profit) films were plotted with their gross and IMDb scores. Note that the vertical and the horizontal line refers to the median gross and the median IMDb score, and bigger point means higher profit earned. Taking a closer look at relationship of these films with their IMDb ratings, we see little correlation between them. This is as expected since generally highly-rated films don’t do very well on box office. However, we can observe that all but two of these 40 films have a score at least 6.4.
Then, these top \(1\%\) movies were plotted again, but, with their profits and IMDb scores, where bigger point means higher IMDb score. For movies with budget over 70 millions dollars, we can observe an upward trend close to linear, which can be inferred that bigger-budget movies tend to earn more profit. However, there’s a slightly downward trend when the budget is less than 70 millions dollars. However, we found that these movies were mostly produced in the 80s or early 90s, such as “E.T. the Extra-Terrestrial” and “Star Wars: Episode IV - A New Hope”, so their true budget should be higher with ticket-inflation being taken into consideration. Therefore, we believe that, if being inflation adjusted, we will observe a more strictly increasing trend in this graph.
However, the profit earned does not give a whole picture about financial success of a movie throughout the years, so “Return on Investment (ROI)” is used to provide a different perspective about a movie’s profitability. The following graph shows top \(1\%\) highest Return on Investment movies of at least 10 millions dollars budget. As expected, films with smaller budget have higher ROI and the ROI decreases as the budget grows bigger. Yet, we can see that the ROIs for movies with over \(\$20M\) budget do not differ much. Also, from this graph and the previous one, we can see that “Star Wars: Episode IV - A New Hope” and “E.T. the Extra-Terrestrial” are really two outstandingly successful films, in terms of IMDb score, Profit, and Return on Investment.
Yueming Zhang. 2017. IMDB 5000 Movie Dataset | Kaggle. https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset.